Enriching a statistical machine translation system trained on small parallel corpora with rule-based bilingual phrases

نویسندگان

  • Víctor M. Sánchez-Cartagena
  • Felipe Sánchez-Martínez
  • Juan Antonio Pérez-Ortiz
چکیده

In this paper, we present a new hybridisation approach consisting of enriching the phrase table of a phrase-based statistical machine translation system with bilingual phrase pairs matching structural transfer rules and dictionary entries from a shallowtransfer rule-based machine translation system. We have tested this approach on different small parallel corpora scenarios, where pure statistical machine translation systems suffer from data sparseness. The results obtained show an improvement in translation quality, specially when translating out-of-domain texts that are well covered by the shallow-transfer rule-based machine translation system we have used.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Bilingual and Comparable Corpora for Low Resource Machine Translation

Statistical machine translation (SMT) performance suffers when models are trained on only small amounts of parallel data. The learned models typically have both low accuracy (incorrect translations and feature scores) and low coverage (high out-of-vocabulary rates). In this work, we use an additional data resource, comparable corpora, to improve both. Beginning with a small bitext and correspon...

متن کامل

Using RBMT Systems to Produce Bilingual Corpus for SMT

This paper proposes a method using the existing Rule-based Machine Translation (RBMT) system as a black box to produce synthetic bilingual corpus, which will be used as training data for the Statistical Machine Translation (SMT) system. We use the existing RBMT system to translate the monolingual corpus into synthetic bilingual corpus. With the synthetic bilingual corpus, we can build an SMT sy...

متن کامل

Statistical Machine Translation with a Small Amount of Bilingual Training Data

The performance of a statistical machine translation system depends on the size of the available task-specific bilingual training corpus. On the other hand, acquisition of a large high-quality bilingual parallel text for the desired domain and language pair requires a lot of time and effort, and, for some language pairs, is not even possible. Besides, small corpora have certain advantages like ...

متن کامل

A Probabilistic Model of Machine Translation

A probabilistic model for computer-based generation of a machine translation system on the basis of English-Russian parallel text corpora is suggested. The model is trained using parallel text corpora with pre-aligned source and target sentences. The training of the model results in a bilingual dictionary of words and " word blocks " with relevant translation probability 1 Introduction.

متن کامل

Bilingual Sense Similarity for Statistical Machine Translation

This paper proposes new algorithms to compute the sense similarity between two units (words, phrases, rules, etc.) from parallel corpora. The sense similarity scores are computed by using the vector space model. We then apply the algorithms to statistical machine translation by computing the sense similarity between the source and target side of translation rule pairs. Similarity scores are use...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011